AITopics | fine-grained late-interaction multi-modal retrieval

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering (Appendix)

Neural Information Processing SystemsApr-26-2026, 22:14:59 GMT

We chose the Google Search corpus [Luo et al., 2021] for our question-answering system as it provides good coverage of the knowledge needed and is publicly available. However, as noted by the authors of RA-VQA, additional knowledge bases may be required to answer some questions correctly. Future work may address the issue by improving the quality and expanding the coverage of knowledge. We do not perceive any immediate ethical concerns associated with the misuse of our proposed system. There is a possibility that the trained KB-VQA system might generate inappropriate or biased content as a result of the training data biases during LLM and LMM pre-training and fine-tuning.

machine learning, natural language, question answering, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (0.29)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)

Industry: Information Technology (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering (Appendix)

Neural Information Processing SystemsFeb-11-2026, 07:05:55 GMT

We chose the Google Search corpus [Luo et al., 2021] for our question-answering system as it provides good coverage of the knowledge needed and is publicly available. Therefore, it is advised to conduct an ethical review prior to deploying the system in live service. Table 1 shows the data statistics of the OK-VQA dataset. We build a DPR retriever as a baseline for FLMR. Equally contributed as the first author 37th Conference on Neural Information Processing Systems (NeurIPS 2023). The inner product search (supported by FAISS [Johnson et al., 2019]) is used to train and In answer generation, we use t5-large and Salesforce/blip2-flan-t5-xl.

machine learning, natural language, question answering, (18 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
North America > Dominican Republic (0.04)

Industry: Information Technology (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Neural Information Processing SystemsDec-24-2025, 23:42:38 GMT

Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) similarity scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained similarities.FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transform using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained similarities between queries and documents.

fine-grained late-interaction multi-modal retrieval, name change, ra-vqa, (8 more...)

Neural Information Processing Systems

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (0.82)
Information Technology > Artificial Intelligence > Natural Language (0.65)
Information Technology > Knowledge Management > Knowledge Engineering (0.59)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.59)

Add feedback

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Neural Information Processing SystemsJan-16-2025, 07:41:53 GMT

Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) similarity scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained similarities.FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transform using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained similarities between queries and documents.

knowledge management, natural language, question answering, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.88)
Information Technology > Sensing and Signal Processing > Image Processing (0.86)
Information Technology > Knowledge Management > Knowledge Engineering (0.62)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.62)

Add feedback

Filters

Collaborating Authors

fine-grained late-interaction multi-modal retrieval

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering (Appendix)

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering (Appendix)

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering